Improved: Motif clustering #39

renewiegandt · 2019-01-05T16:20:08Z

Improved motif clustering by comparing the motifs of each cluster separately with the merged motif file.
Added new R-script which labels the TSV-files with the corresponding cluster ID.

HendrikSchultheis

I found some things you should change but nothing major. Also there are some spelling errors that should be fixed. You can use the hunspell package to check your scripts for typos.

HendrikSchultheis · 2019-01-05T19:55:16Z

bin/2.2_motif_estimation/bed_to_fasta.R

@@ -1,13 +1,13 @@
 #!/usr/bin/env Rscript
-library("optparse")
+if (!require(optparse)) install.packages("optparse"); library(optparse)

 option_list <- list(
  make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"),


cluster_id -> cluster id

HendrikSchultheis · 2019-01-05T19:55:52Z

bin/2.2_motif_estimation/bed_to_fasta.R


 option_list <- list(
  make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"),
  make_option(opt_str = c("-p", "--prefix"), default = "" , help = "Prefix for file names. Default = '%default'", metavar = "character"),
  make_option(opt_str = c("-m", "--min_seq"), default = 100, help = "Minimum amount of sequences in clusters. Default = %default", metavar = "integer")
 )

-opt_parser <- OptionParser(option_list = option_list, 
+opt_parser <- OptionParser(option_list = option_list,
                           description = "Convert BED-file to one FASTA-file per cluster")


...cluster.

Author and email are missing.

HendrikSchultheis · 2019-01-05T19:56:47Z

bin/2.2_motif_estimation/bed_to_fasta.R


 option_list <- list(
  make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"),
  make_option(opt_str = c("-p", "--prefix"), default = "" , help = "Prefix for file names. Default = '%default'", metavar = "character"),
  make_option(opt_str = c("-m", "--min_seq"), default = 100, help = "Minimum amount of sequences in clusters. Default = %default", metavar = "integer")
 )

-opt_parser <- OptionParser(option_list = option_list, 
+opt_parser <- OptionParser(option_list = option_list,
                           description = "Convert BED-file to one FASTA-file per cluster")

 opt <- parse_args(opt_parser)


The Sequences of each cluster are written as an FASTA-file.

HendrikSchultheis · 2019-01-05T19:59:15Z

bin/2.2_motif_estimation/bed_to_fasta.R

  if (is.null(bedInput)) {
    stop("ERROR: Input parameter cannot be null! Please specify the input parameter.")
  }

  bed <- data.table::fread(bedInput, sep = "\t")
-
+
  # Get last column of data.table, which refers to the cluster, as a vector.
  cluster_no <- as.vector(bed[[ncol(bed)]])


You can remove as.vector. Using [[]] already returns a vector.

HendrikSchultheis · 2019-01-05T20:03:39Z

bin/2.2_motif_estimation/bed_to_fasta.R

  # Split data.table bed on its last column (cluster_no) into list of data.frames
  clusters <- split(bed, cluster_no, sorted = TRUE, flatten = FALSE)
-  
+
  # For each data.frame(cluster) in list clusters:
  discard <- lapply(1:length(clusters), function(i){


It's nicer to use seq_len instead of 1:x.

HendrikSchultheis · 2019-01-05T23:49:14Z